Pandas is a very popular Python package that provides data structures and data analysis tools. It includes tools for reading and writing various data formats, processing data sets in an efficient DataFrame object, and the ability to reshape, filter, index, and subset data easily.
This notebook is a quick introduction to some of the features of the Pandas package.
The primary data structure in the Pandas package is the DataFrame. A DataFrame is a table-like object that contains named columns of data. Pandas provides readers for many different file types including CSV, Excel, SAS, HTML, etc. Since CSV is such a common format, let's look at reading a CSV file into a Pandas DataFrame.
First, we need to import the Pandas package. A very common convention is to import Pandas using a shortcut named of 'pd'. This is done with the following line.
In [1]:
import pandas as pd
The function that reads CSV files into DataFrames is called read_csv
. In the simplest form, you just supply it with a filename or URL. We have the SAS cars data set stored in github that we can point to directly. We'll use the head
method to display just the first few records.
In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/'
'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()
Out[2]:
There are many more Pandas data readers that you read about on the Pandas web site.
In [3]:
df.columns
Out[3]:
The data types of the columns can be displayed using the dtypes
property.
In [4]:
df.dtypes
Out[4]:
For general information about the DataFrame as a whole, you can use the info
method.
In [5]:
df.info()
Now that we know more about the columns and their data types, we can move on to subsetting DataFrames into other DataFrames or columns.
You can select columns from a DataFrame to create a DataFrame with fewer columns, or you can select a single column of data into a Series. The syntax used for this operation is Python's indexing syntax [...]
.
If you index into a DataFrame using just the name of a column, you will get a Series back. If you index into a DataFrame using a list of names, you will get a DataFrame back.
In [6]:
subdf = df[['Make', 'Model', 'Horsepower']]
subdf.head()
Out[6]:
In [7]:
model = df['Model']
model.head()
Out[7]:
In addition to selecting columns of data, DataFrames have more advanced ways of selecting data. We'll look at that in the next section.
Pandas DataFrames have an index associated both with rows and columns. We have seen the column index in the columns
property above. We have also indexed into that collection using the column names. You can also index rows by position and label using the iloc
and loc
properties, respectively.
Currently, our cars DataFrame only has numeric row indexes, it doesn't have labels, so we'll start with iloc
.
In [8]:
df.iloc[3:7]
Out[8]:
As you can see above, we selected the row indexs from 3 to 7. In Python, numeric slices such as this include the first index, but do not include the last index. That is why you see row indexes 3, 4, 5, and 6 only in the result.
You can also select rows using a list of row indexes.
In [9]:
df.iloc[[4, 5, 10]]
Out[9]:
Arguably a more useful way of indexing Pandas DataFrames is using row labels. We can set a column as a row index using the set_index
method. Then we can use those labels in the loc
property for easy data selection.
In [10]:
df = df.set_index('Model')
In [11]:
df.head()
Out[11]:
You'll see in the output above that the Model column is now displayed as a row index. We can use the values in that index in the loc
property. This includes both single values, lists of values, or slices.
In [12]:
df.loc['MDX']
Out[12]:
In [13]:
df.loc[['TSX 4dr', 'TL 4dr']]
Out[13]:
In [14]:
df.loc['MDX':'TL 4dr']
Out[14]:
Note that when using slicing on labels, the endpoint for the slice is included. This is different than for row indexes.
In [15]:
df.iloc[1:5, 3:6]
Out[15]:
In [16]:
df.loc['MDX':'TL 4dr', 'MSRP':'Horsepower']
Out[16]:
If you want to mix positional and label information between rows and columns, you need to use the ix
property.
In [17]:
df.ix[2:6, 'MSRP':'Cylinders']
Out[17]:
Of course, the ix
property also supports single values, lists of values, and slices for each component.
In [18]:
df.ix[[1,2,5], ['Make', 'MSRP', 'Horsepower']]
Out[18]:
A more dynamic way of indexing DataFrames is to use boolean indexing. Rather than specifying explicit values for the index, we will use an expression to indicate which rows we want to select. The expression creates a boolean Series which indicates the rows to keep. Here is an example of a boolean series created by comparing the MSRP column to 40000.
In [19]:
df['MSRP'] > 40000
Out[19]:
As you can see, the values of the Series are all True or False. If we use that condition as an index value of a DataFrame, we'll get a new DataFrame containing only the rows where the condition was true.
In [20]:
df[df['MSRP'] > 40000].head()
Out[20]:
You can combine conditions using &
for "and" and |
for "or". For example, if we wanted cars that had an MSRP greater than $40,000, but also had 8 or more Cylinders, we could do the following. Note that due to the order of precedence of comparison and combination operatiors, you need to surround your comparions with parentheses.
In [21]:
df[(df['MSRP'] > 40000) & (df['Cylinders'] > 8)].head()
Out[21]:
In [22]:
df.sort_index().head()
Out[22]:
Sorting by values is done by specify the columns that you want to sort by.
In [23]:
df.sort_values(['MSRP', 'Horsepower']).head()
Out[23]:
You can also use the ascending=
option to specify the order in which the columns should be sorted.
In [24]:
df.sort_values(['MSRP', 'Horsepower'], ascending=[False, True]).head()
Out[24]:
Keep in mind that the sort_index
and sort_values
methods return a new DataFrame with the sorting applied. If you want to sort a DataFrame in place, you need to specify the inplace=True
option.
Let's reset the index before moving on to the next sections.
In [25]:
df.reset_index(inplace=True)
In [26]:
df.describe()
Out[26]:
The describe
method has a couple of options to select different columns of data (include=
) and the percentiles to compute (percentiles=
). In the example below, we are selecting all of the columns and changing the displayed percentiles to 30% and 70% (50% is always displayed).
In [27]:
df.describe(include='all', percentiles=[0.3, 0.7])
Out[27]:
You can also select individual statistics using their own methods.
In [28]:
df.min()
Out[28]:
In [29]:
df.median()
Out[29]:
All of the above methods also work on individual columns, or subsets of columns.
In [30]:
df['MSRP'].describe()
Out[30]:
In [31]:
df['MSRP'].min()
Out[31]:
In [32]:
df[['MSRP', 'Horsepower']].describe()
Out[32]:
In [33]:
grpdf = df.groupby('Origin')
grpdf
Out[33]:
You'll notice that in this case the returned value is a DataFrameGroupBy
object. Many of the methods available on a DataFrame will also work on the DataFrameGroupBy
object.
In [34]:
grpdf.describe()
Out[34]:
In [35]:
grpdf[['MSRP', 'Horsepower']].describe(percentiles=[0.2, 0.8])
Out[35]:
It is also possible to loop through all of the groupings using iteration in Python.
In [36]:
for name, group in grpdf:
print('=' * 35, name, '=' * 35)
print('')
print(group.head())
print('')
There are far too many grouping features in Pandas to cover here. For more information, see the Pandas Group By documentation.
There are several packages for creating plots in Python. These include matplotlib, seaborn, bokeh, plot.ly, or even Pandas itself. Many of these packages such as seaborn and the Pandas plotting features use matplotlib in the background. Packages like bokeh and plot.ly are primarily focused on graphics that are rendered in a web browser.
The most basic plotting features can be accessed in the plot
method of the DataFrame. To start, we can create a scatter plot of the MSRP values.
In [37]:
%matplotlib inline
In [38]:
df.plot(kind='scatter', x='MSRP', y='Horsepower', figsize=(12,6))
Out[38]:
Using the cufflinks package, which is a DataFrame-style interface to plot.ly, we can create a graph that is rendered in the browser and includes interactive features.
In [39]:
import cufflinks
# Configure cufflinks to render the graph locally.
cufflinks.go_offline()
df.iplot(kind='scatter', x='MSRP', y='Horsepower', mode='markers',
size=8, dimensions=(750, 400))
For more information on plotting features of DataFrames, see the Pandas Visualization documentation.
We've just covered the very basics of the Pandas package here. You should have enough to get started, but for more information, you should see the official documentation.